Statistical Inference III: Hypothesis Testing and Confidence Intervals
Introduction
Recall the basic setup of frequentist statistical inference under the random sampling framework. Here, we view the observed data \(\boldsymbol x = (x_1, \ldots, x_n)'\) as a realization of the random variables \(\boldsymbol X = (X_1, \ldots, X_n)'\) independently drawn from an an unknown, common (identical) distribution \(F\).1 In other words, the data \(\boldsymbol x\) is a realization of a random sample \(\boldsymbol X\) from the population \(F\). The goal of statistical inference is to use the abstraction of random sampling to make statements about \(F\), while quantifying the uncertainty in those statements. Typically, we need to make some starting assumptions about the structure of \(F\) to be able to say anything interesting about it. This is called a statistical model for \(F\). In this post, we will work with the normal sampling model, where we assume that \(F\) is the normal distribution with unknown mean \(\mu\) and unknown variance \(\sigma^2\).
1 The phrasing \(X\) “drawn” from \(F\) is informal shorthand to mean the random variable \(X\) has the distribution \(F\).
Previously, we discussed one form of statistical inference: point estimation. This entailed developing a rule (i.e. an estimator) that used the realized data to produce a best guess (i.e. an estimate) of some parameter of interest \(\theta\) determined by the population \(F\). In point estimation, uncertainty arises from the fact that the estimator is a function of the random sample, and is therefore also random. The uncertainty is quantified by asking the question: “If we repeatedly drew random samples from the population and computed the estimate each time, how would the estimates vary?” The answer to this question was given by the sampling distribution of the estimator.
Hypotheses
Hypothesis testing is a fundamental tool to conduct statistical inference. At its core is the hypothesis: a statement about a scalar parameter of interest \(\theta\) determined by the population \(F\). In a non-parametric model — where \(F\) cannot be fully characterized by a finite set of model parameters — \(\theta\) is some function of \(F\), like \(\mathbb{E}[X]\) or \(\operatorname{Var}(X)\). In a parametric model — where \(F\) is assumed to belong to a family of distribution characterized by a finite number of parameters — \(\theta\) is typically one of the model parameters.2 For example, in the normal sampling model, our hypotheses are usually about \(\mu\) or \(\sigma^2\).
2 It is worth emphasizing that while the parameter of interest is often the model parameters, the two are not always the same.
3 Technically, the null hypothesis can be a set of values as well. However, in this post, we focus on point (null) hypotheses.
Hypothesis testing is formulated in terms of two complementary hypotheses. The null hypothesis \(H_0\) is the hypothesis to be tested, while the alternative hypothesis \(H_1\) is the complement of the null hypothesis. These hypotheses can be defined by how they restrict the parameter space of \(\theta\), denoted \(\Theta\). Specifically, the null is defined by the restriction \(\theta = \theta_0\) for some hypothesized value \(\theta_0\), while the alternative is defined by the set \(\{\theta \in \Theta: \theta \neq \theta_0\}\).3
In this post, we will focus on the problem of testing hypotheses about the population mean \(\mu\) under the normal sampling model. Specifically, our hypotheses are \[ H_0: \mu = \mu_0 \quad \text{and} \quad H_1: \mu \neq \mu_0, \] where \(\mu_0\) is some hypothesized value of \(\mu\). The goal of hypothesis testing is to test the validity of \(H_0\) over \(H_1\) using the random sample. In the next few sections, we will develop the machinery required to do exactly this.
Test Statistic and Critical Region
The outcome of a hypothesis test is a decision: either accept the null or reject the null in favor of the alternative. Thus, we want a decision rule that maps the sample space — the set of all possible realizations of the random sample — into one of these two actions. One way to formulate this rule is as follows. First, we construct a function of the random sample called the test statistic
\[ T: \boldsymbol X \rightarrow \mathbb{R}. \] The role of the test statistic is to take any random sample from the population and compress it into a single number that reflects the information most relevant to the hypothesis. Since it is a function of random variables, the test statistic is itself a random variable. Second, we define a critical region \(C\) to be some subset of the range of \(T\). The decision rule can then be framed in terms of the observed realization of the test statistic, \(t\): (i) reject \(H_0\) if \(t \in C\), and (ii) accept \(H_0\) if \(t \notin C\).
It’s worth pausing here to emphasize two important features of this setup. First, the effectiveness of a hypothesis test depends on the choice of both the test statistic and critical region. If \(T\) does not capture the right information, or if \(C\) is poorly specified, the resulting decision rule will not be very insightful. Second, making incorrect decisions is inevitable in hypothesis testing because of the randomness inherent in the sample. Even with a well-chosen statistic and critical region, it is entirely possible to draw a sample with a mean far from the true population mean \(\mu\) but close to the hypothesized mean \(\mu_0\), or conversely, a sample far from \(\mu_0\) even when \(\mu = \mu_0\). In both situations, we make an incorrect decision purely due to chance.
Thus, in carrying out hypothesis tests, the focus is not on eliminating mistakes entirely, but rather on quantifying and controlling the probability of error. The following section formalizes this idea.
Classical Approach to Hypothesis Testing
Error Probabilities and Power Function
There are two types of incorrect decisions we could make in hypothesis testing. Rejecting the null when it is actually true is called a Type I error. Accepting the null when it is actually false is called a Type II error.
\[ \begin{array}{c|c|c} & \textbf{Accept $H_0$} & \textbf{Reject $H_0$} \\ \hline \textbf{$H_0$ True} & \text{Correct Decision} & \text{Type I Error} \\ \hline \textbf{$H_1$ True} & \text{Type II Error} & \text{Correct Decision} \\ \end{array} \]
To build the vocabulary and notation required to talk about the probability of making these two errors, we need to introduce the power function of a hypothesis test. This is the probability of rejecting the null \(H_0\) under some population distribution \(F\), and is denoted \[ \pi(F) = \mathbb{P}[\text{Reject } H_0 \mid F] = \mathbb{P}[T \in C \mid F]. \]
The power function formalizes the source of randomness in hypothesis tests. Our data is random because they are realizations of a random sample drawn from the distribution \(F\). Since the test statistic is computed from the data, its randomness is also induced by \(F\). Consequently, the decision to accept or reject the null is likewise determined by \(F\). The power function summarizes this chain by expressing the probability of rejection directly as a function of the underlying distribution \(F\).
The probability of making a Type I error is the size of the hypothesis test, and is simply the power function evaluated under the distribution implied by the null, \(F_0\): \[ \mathbb{P}[\text{Reject } H_0 \mid F_0] = \pi(F_0). \]
The power of the hypothesis is the complement of the probability of making a Type II error, and is given by the power function evaluated under the distribution implied by the alternative, \(F_1\): \[ 1 - \mathbb{P}[\text{Accept } H_0 \mid F_1] = \mathbb{P}[\text{Reject } H_0 \mid F_1] = \pi(F_1). \]
Classical Approach to the Fundamental Tradeoff
Notice that we could mechanically decrease the probability of a Type I error by reducing the size of \(C\). However, this would in turn increase the probability of a Type II error. Conversely, increasing the size of \(C\) would decrease the probability of a Type II error, but increase the probability of a Type I error. This is the fundamental tradeoff in hypothesis testing: reducing the probability of one type of error comes at the cost of increasing the probability of the other.
The classical approach to hypothesis testing is to bound the size of the test at some pre-specified level \(\alpha \in (0,1)\), called the significance level. More formally, this means choosing the critical region \(C\) so that \[ P[T \in C \mid F_0] \leq \alpha. \]
To construct such a critical region, we need to know the distribution of the test statistic \(T\) under the null hypothesis \(H_0\), called the null sampling distribution:
\[ G_0(t) = \mathbb{P}[T \leq t \mid F_0]. \]
One reason why we focus on bounding the probability of Type I errors instead of Type II errors is because the null sampling distribution is typically easier to derive than the distribution of \(T\) under the various alternative hypotheses.
Critical Values, One-Sided and Two-Sided Tests
For most classical test statistics, the null sampling distribution is unimodal and symmetric, with probability that steadily decreases as we move away from the center. This implies that the least likely realizations of the test statistic under the null lie in the tails of the distribution. In such cases, the critical region can be fully characterized by a single critical value \(c\). For a one-sided test, the critical region is \[
C = \{t: t > c\} \quad \text{or} \quad C = \{t: t < c\}.
\] For a two-sided test, the critical region is
\[
C = \{t: |t| > c\}.
\]
The critical value \(c\) is chosen so that the size of the test is equal to the significance level \(\alpha\). By distributing the critical region across the tails of the null sampling distribution, we maximize the test’s power while ensuring that its size is exactly controlled at \(\alpha\).
Two-Sided Test of the Mean in the Normal Sampling Model
We have now developed all the terminology needed to carry out the hypothesis test of the mean under the normal sampling model.
t-Statistic
The first step to any hypothesis test is to choose a test statistic. A common choice is the t-ratio or t-statistic. For some estimator \(\hat\theta\) of the parameter of interest \(\theta\), the t-statistic is defined as
\[
T = \frac{\hat\theta - \theta_0}{SE(\hat\theta)},
\] where \(\theta_0\) is a hypothesized value of \(\theta\) and \(SE(\hat\theta)\) is the standard error of \(\hat\theta\) — i.e., an estimator of the standard deviation of \(\hat\theta\). In words, the realization of this statistic measures how many standard errors away the estimate \(\hat\theta(\boldsymbol x)\) is from the hypothesized value \(\theta_0\). Thus, the realization of the t-statistic has an intuitive interpretation: a large magnitude indicates that the estimate is far from the hypothesized value in the realized sample, while small values indicate that the estimate is close to the hypothesized value.
In the context of testing the mean \(\mu\) in the normal sampling model, the sample mean \(\bar X\) is a natural estimator. Recall that the variance of the sample mean \(\bar X\) in the normal sampling model is \(\sigma^2 / n\). However, since \(\sigma^2\) is unknown, we use the bias-corrected variance estimator \[ s^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2 \] to estimate \(\sigma^2\). The standard error of \(\bar X\) is then given by the estimator \[ SE(\bar X) = \frac{s}{\sqrt{n}}. \]
Thus, the t-statistic for testing the mean \(\mu\) in the normal sampling model is given by
\[ T = \frac{\bar X - \mu_0}{s / \sqrt{n}}. \]
Null Sampling Distribution
To construct the critical region, we need to derive the null sampling distribution of the t-statistic. A special property of the t-statistic in the normal sampling model is that it is a pivotal quantity, meaning that its null sampling distribution does not depend on unknown parameters.
If \(\sigma^2\) is known, it can be shown that the null sampling distribution of the t-statistic is the standard normal distribution. If \(\sigma^2\) is unknown, however, then the null sampling distribution is the Student’s t-distribution with \(n-1\) degrees of freedom. An important property of the Student’s t-distribution is that it converges to the standard normal distribution as the sample size \(n\) increases.
Choosing the Critical Value
Blah Blah
p-Values
Blah Blah
Confidence Intervals
Blah blah